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ABSTRACT 

We  investigate  three  issues  in  distributed  information  retrieval, 
considering  both  TREC  data  and  U.S.  Patents:  (1)  topical  organi¬ 
zation  of  large  text  collections,  (2)  collection  ranking  and  selec¬ 
tion  with  topically  organized  collections  (3)  results  merging,  par¬ 
ticularly  document  score  normalization,  with  topically  organized 
collections.  We  find  that  it  is  better  to  organize  collections  topi¬ 
cally,  and  that  topical  collections  can  be  well  ranked  using  either 
INQUERY's  CORI  algorithm,  or  the  Kullback-Leibler  divergence 
(KL),  but  KL  is  far  worse  than  CORI  for  non-topically  organized 
collections.  For  results  merging,  collections  organized  by  topic 
require  global  idfs,  for  the  best  performance.  Contrary  to  results 
found  elsewhere,  normalized  scores  are  not  as  good  as  global  idfs 
for  merging  when  the  collections  are  topically  organized. 
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1.  INTRODUCTION 

We  have  developed  a  distributed  system  for  the  search  and  classi¬ 
fication  of  U.S.  patents  [11],  using  INQUERY,  a  search  engine 
developed  at  the  Center  for  Intelligent  Information  Retrieval  at  the 
University  of  Massachusetts  [3],  Our  design  choices  were  guided 
by  recent  research  on  managing  large  text  collections  and  retriev¬ 
ing  documents  from  distributed  databases.  The  performance  of 
our  system  led  us  to  question  the  applicability  of  these  methods  to 
collections  organized  by  topic,  and  stimulated  the  present  re¬ 
search. 

Most  research  on  searching  distributed  collections  has  focused 
upon  two  issues  (1)  Collection  ranking :  ranking  collections  and 
selecting  from  them  a  small  number  to  search  for  a  given  query, 
and  (2)  Results  merging :  combining  the  ranked  lists  of  documents 
returned  from  each  of  the  selected  collections  into  a  single  ranked 
list.  Our  research  addresses  these  and  a  third  important  issue,  that 
of  (3)  topical  organization :  the  subdivision  of  data  by  topic,  and 
its  interaction  with  collection  ranking  and  results  merging. 
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The  advantages  of  dividing  a  very  large  text  collection  into 
smaller  collections  include  faster  response  time,  simplification  of 
administration,  and  the  possibility  of  restricting  the  search  to  the 
best  part  of  the  collection.  The  obvious  disadvantage  is  that  one 
cannot  retrieve  documents  from  collections  outside  of  the  se¬ 
lected,  top-ranked  set.  In  spite  of  this  disadvantage,  some  recent 
studies  have  claimed  that  given  good  organization  of  data,  collec¬ 
tion  ranking,  and  results  list  merging,  one  can  achieve  retrieval 
performance  from  distributed  databases  that  approaches  that  from 
a  single  centralized  database  [12]  [19]. 

We  investigate  how  best  to  organize  data  by  comparing  retrieval 
from  collections  organized  topically  with  retrieval  from  collec¬ 
tions  organized  chronologically  or  by  source,  using  TREC  data 
and  U.S.  Patent  data.  We  investigate  the  second  issue,  collection 
ranking  for  topically  organized  collections,  by  comparing  two 
collection  ranking  algorithms  on  the  TREC  and  patent  collec¬ 
tions.  Third,  we  address  results  merging  for  topically  organized 
collections  by  comparing  four  different  merging  algorithms  on 
patent  and  TREC  collections  under  topical  and  non-topical  or¬ 
ganizations. 

This  is  the  first  collection  selection  study  involving  large  data  sets 
that  supplements  TREC  data  with  another  collection.  This  is  im¬ 
portant  to  avoid  bias.  Our  research  is  the  first  to  examine  retrieval 
from  topically  organized  collections  that  are  not  subdivided  by 
clustering,  but  by  a  human-designed  category  scheme  of  consider¬ 
able  abstractness  and  complexity.  Our  investigation  of  different 
merging  algorithms  with  topically  organized  data  is  also  unique. 

2.  PREVIOUS  RESEARCH 
2.1  Topical  Organization 

We  look  at  three  ways  of  subdividing  large  corpora:  by  date,  by 
source,  and  by  topic.  Chronological  organization  is  particularly 
appropriate  for  corpora  with  a  continual  influx  of  new  documents, 
such  as  news  archives  or  patents.  A  new  collection  can  be  added 
for  each  week,  month,  year,  etc.  Chronologically  organized  sets  of 
collections  tend  to  have  convenient  statistical  properties,  such  as 
similar  sizes  and  term  frequency  distributions.  The  disadvantage 
is  that  documents  relevant  to  a  query  may  be  scattered  throughout 
the  collections,  allowing  little  chance  of  finding  them  in  a  search 
restricted  to  a  small  number  of  collections,  unless  the  query  con¬ 
cerns  something  like  a  news  event  which  gets  most  of  its  coverage 
in  a  narrow  time  window. 

The  second  common  mode  of  organization  is  by  source,  for  ex¬ 
ample,  Associated  Press,  Wall  Street  Journal,  Federal  Register, 
etc.,  which  can  simulate  retrieval  from  different  providers.  Or¬ 
ganization  by  source  falls  between  topical  and  chronological  or- 
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ganization  in  that  different  sources  tend  to  concentrate  on  some¬ 
what  different  content. 

Under  topical  organization,  documents  about  similar  subjects  are 
grouped  into  the  same  collection.  If  this  grouping  is  well  done, 
most  or  all  of  the  relevant  documents  for  a  query  should  poten¬ 
tially  be  found  in  one  or  a  small  number  of  collections,  according 
to  van  Rijsbergen’s  cluster  hypothesis  that  closely  associated 
documents  should  be  relevant  to  the  same  queries  [14].  In  the 
tradition  of  early  work  on  clustering  documents  and  evaluating 
queries  against  clusters  rather  than  single  documents  [7]  [9],  Xu 
and  Croft  [  1 9]  have  shown  that  for  TREC  queries,  topical  organi¬ 
zation  by  global  clustering  does  in  fact  concentrate  most  relevant 
documents  into  a  small  number  of  collections.  Xu  and  Croft  di¬ 
vided  TREC  collections  into  100  subcollections  either  by  source 
or  by  topic  using  clustering.  They  found  far  better  retrieval  per¬ 
formance  with  subcollections  divided  by  topic  compared  to  the 
heterogeneous  subcollections  divided  by  source.  Retrieval  from 
the  best  10  topical  subcollections  was  comparable  to  centralized 
retrieval,  whereas  the  retrieval  from  the  10  best  source-based  sub¬ 
collections  showed  25-30%  lower  precision  than  centralized  re¬ 
trieval. 

In  creating  our  distributed  patent  system,  we  chose  a  topical  or¬ 
ganization  by  patent  class  because  each  U.S.  patent  belongs  to  one 
of  400  patent  classes.  Unlike  the  TREC  clusters,  patent  classes 
are  of  human  design,  and  are  currently  in  active  use  by  patent 
searchers  and  the  USPTO  (United  States  Patent  and  Trademark 
Office).  Patents  have  been  manually  assigned  to  the  classes  ac¬ 
cording  to  extremely  abstract  criteria.  Automatic  classification 
into  patent  classes  works  surprisingly  poorly,  suggesting  that 
these  groupings  are  not  what  one  would  obtain  by  clustering. 
These  data  provide  a  good  testbed  for  generalizing  the  clustering 
results  to  a  topical  organization  with  an  extremely  different  basis. 

2.2  Collection  Ranking 

Most  collection  selection  research  considers  distributed  collec¬ 
tions  that  are  autonomous  and  private.  It  is  assumed  to  be  too 
costly  to  query  all  the  available  collections,  so  a  small  number 
must  be  selected.  Some  researchers  rely  on  manually  created 
characterizations  of  the  collections  [4],  others  require  a  set  of 
reference  queries  or  topics  with  relevance  judgements,  and  select 
those  collections  with  the  largest  numbers  of  relevant  documents 
for  topics  that  are  similar  to  the  new  query  [17]. 

We  are  interested  in  the  class  of  approaches  including  CORI  [1], 
gGlOSS  [6],  and  others  [8]  [20],  that  characterize  different  col¬ 
lections  using  collection  statistics  like  term  frequencies.  These 
statistics,  which  are  used  to  select  or  rank  the  available  collec¬ 
tions’  relevance  to  a  query,  are  usually  assumed  to  be  available 
from  cooperative  providers.  Alternatively,  statistics  can  be  ap¬ 
proximated  by  sampling  uncooperative  providers  with  a  set  of 
queries  [2].  In  the  present  study  we  compare  two  of  these  ap¬ 
proaches,  CORI  and  topic  modeling. 

The  distributed  patent  system  uses  the  CORI  net  (collection  re¬ 
trieval  information  network)  approach  in  INQUERY  [1],  de¬ 
scribed  in  more  detail  in  section  3.3.1,  because  this  method  has 
been  shown  successful  in  ranking  collections,  and  outperforms 
some  of  the  best  alternative  approaches  [5],  Given  our  topically 
organized  data,  we  thought  we  might  get  better  performance  from 


the  topic  modeling  approach  used  by  Xu  and  Croft  [19]  to  rank 
their  clustered  collections. 

A  topic  model  is  a  probability  distribution  over  the  items  in  a 
corpus,  in  this  case  unigrams.  The  Kullback-Leibler  (KL)  diver¬ 
gence  [10],  an  information  theoretic  metric  used  to  measure  how 
well  one  probability  distribution  predicts  another,  was  applied  to 
measure  how  well  a  topic  model  predicts  a  query  or  a  document. 
In  Xu  and  Croft,  this  topic  modeling  approach  performed  better 
than  CORI  on  clustered  TREC4  data  according  to  two  measures: 
higher  ranking  collections  contained  larger  numbers  of  relevant 
documents,  and  retrieval  attained  higher  precision.  It  is  signifi¬ 
cant,  however,  that  the  same  KL  measure  was  used  in  creating  the 
topical  clusters.  This  method  of  collection  selection  may  be 
uniquely  suited  to  selection  when  the  collections  have  been  or¬ 
ganized  based  on  the  same  KL  metric.  Xu  and  Croft’s  results 
leave  open  two  issues  we  address  here,  (1)  whether  KL  is  supe¬ 
rior  to  CORI  even  when  the  topical  scheme  is  not  tied  so  closely 
to  the  retrieval  metric,  and  (2)  how  topic  modeling  performs  when 
collections  are  not  organized  according  to  topics. 

2.3  Results  Merging 

In  a  typical  distributed  retrieval  situation,  document  scores  from 
different  providers  may  be  computed  differently,  or  not  be  pro¬ 
vided  at  all.  To  present  the  user  with  a  single  ranked  list,  these 
lists  must  be  merged  into  an  accurate  single  ordering.  When  no 
scores  are  provided,  solutions  depend  only  on  the  ranking  of  col¬ 
lections  and  the  number  and  ordering  of  documents  retrieved  from 
each  collection  [16].  When  scores  are  provided,  one  can  attempt 
to  scale  the  disparate  scores  [15]  [18],  Even  in  our  relatively  con¬ 
sistent  situation  where  all  the  document  scores  are  provided  by 
INQUERY,  the  differences  in  the  statistical  makeup  of  the  collec¬ 
tions  present  a  barrier  to  an  accurate  ordering  of  documents  from 
different  collections.  In  the  typical  if -idf  document  score  com¬ 
puted  in  INQUERY  and  most  other  systems  [3][13],  the  idf  (in¬ 
verse  document  frequency)  component  is  a  function  of  the  number 
of  documents  in  the  collection  containing  the  query  terms,  so  that 
identical  documents  in  different  collections  would  receive  differ¬ 
ent  document  scores. 

One  approach,  taken  by  Xu  and  Croft  [19],  is  to  avoid  the  prob¬ 
lem  by  using  global  idfs,  i.e.  idf  s  from  the  full  set  of  documents 
in  all  the  collections,  in  computing  document  scores.  In  IN¬ 
QUERY,  we  compute  normalized  document  scores  which  are 
scaled  using  maximum  and  minimum  possible  scores  to  attempt  to 
make  them  comparable  across  collections.  Powell  et  al.  [12] 
found  that  TREC  document  scores  could  be  effectively  normal¬ 
ized  this  way,  yielding  retrieval  performance  as  good  as  that  at¬ 
tained  via  global  idf.  However,  the  document  rankings  we  ob¬ 
tained  from  our  distributed  PTO  system  suggested  that  this  nor¬ 
malization  was  not  sufficient  for  patent  data. 

When  we  searched  the  distributed  patent  database,  we  would  of¬ 
ten  find  apparently  non-relevant  documents  at  the  top  of  the  list 
and  good  documents  at  lower  ranks.  In  contrast,  when  we 
searched  a  single  database  containing  two  years  of  patents,  we 
would  get  good  retrieval  results.  A  closer  analysis  of  the  situation 
revealed  that  the  collection  ranking  algorithms  were  doing  a  good 
job  of  selecting  collections,  but  that  documents  from  lower-rank¬ 
ing  collections  (among  the  top  10)  were  outranking  documents 
from  higher  ranking  collections.  Thus  the  problem  was  one  of 
results  merging. 
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Table  1  shows  one  example  of  such  a  pattern,  obtained  from  the 
query  “Accordion  musical  instrument.”  The  ranked  list  of  patent 
classes  for  the  query  is  above,  and  the  ranked  list  of  documents 
after  merging  is  below.  The  number  to  the  left  of  each  document 
title  indicates  the  patent  class,  and  hence,  the  collection,  where  the 
document  resides.  Merging  was  based  on  INQUERY's  normal¬ 
ized  document  scores. 


Class 

Class  Description 

084 

Music 

381 

Signal  Processing  Systems  and  Devices 

181 

Acoustics 

446 

Amusement  devices:  Toys 

434 

Education  and  demonstration 

281 

Books,  strips,  &  leaves 

369 

Dynamic  information  storage  or  retrieval 

Patent  Title 

369 

Automatic  musical  instrument  playback  from  digital  source 

369 

Electronic  apparatus  with  magnetic  recording  device 

369 

Method  and  apparatus  for  restoring  aged  sound  recordings 

369 

Auto-playing  apparatus 

369 

Disc  playing  apparatus  . . . 

369 

Subcode  info  and  block  ID  system  for  a  disc  player 

381 

Microphone  pickup  system 

084 

Slender  housing  for  electronic  M.I.D.I.  accordion 

084 

Accordion  support  apparatus 

084 

Electronic  accordion  housing  and  support  stand 

084 

Accordion  with  new  order  of  sounds 

Table  1.  Problem  Query  Example.  Ranked  list  of  classes  and 
patents  for  query  “Accordion  Musical  Instrument”. 

In  this  example,  many  patents  that  mention  music,  instruments, 
and  accordions,  are  in  the  best  class  for  the  query,  music.  In  this 
class  each  of  these  query  terms  has  a  relatively  low  idf  For  less 
relevant  collections,  these  terms  are  rare,  and  hence  have  higher 
idfa,  which  results  in  higher  document  scores  for  the  documents  in 
the  lower-ranked  collections.  Normalization  should  compensate 
for  the  disparity,  but  was  not  fully  successful.  This  rare  term 
problem  has  been  noted  before  [16]  [19].  However,  in  the  PTO 
situation,  the  rare  term  problem  is  not  at  all  rare.  Due  to  the 
skewed  term  distributions  across  collections  and  the  short,  spe¬ 
cific  PTO  queries,  most  query  terms  are  rare  terms. 

The  failure  in  the  PTO  system  of  normalization  methods  that  were 


successful  in  other  distributed  systems  motivated  the  merging  part 
of  our  research.  There  is  no  prior  research  on  merging  and  nor¬ 
malization  methods  for  topically  organized  collections.  We  com¬ 
pare  several  different  merging  algorithms,  with  TREC  data  and 
with  patent  data,  organized  topically  and  otherwise. 

3.  EXPERIMENTAL  METHOD 

We  use  two  different  data  sets  in  this  research,  which  we  refer  to 
as  TREC3  and  PTO.  Their  statistics  can  be  seen  in  Table  2 

3.1  TREC  Data 

The  TREC3  data  set  is  the  TREC3  data  set  reported  in  Xu  and 
Croft  [19].  This  set  of  741,856  documents  was  broken  up  into 
100  collections  in  two  ways,  by  topic  and  by  source.  The  by-topic 
organization  was  Xu  and  Croft’s  TREC3-100col-global  set.  The 
documents  were  clustered  by  a  two  pass  K-means  algorithm  using 
the  Kullback-Leibler  divergence  as  the  distance  metric.  The  by¬ 
source  organization  was  Xu  and  Croft’s  TREC3-100col-bysource 
set.  Here,  the  documents  were  grouped  by  source,  allocating  a 
number  of  collections  to  each  source  that  was  proportional  to  the 
total  number  of  documents  from  that  source.  The  50  TREC3  que¬ 
ries  were  based  on  TREC  topics  151-200. 

3.2  PTO  Data 

The  PTO  data  set  is  made  up  of  virtually  all  utility  and  plant  pat¬ 
ents  from  the  years  1980  through  1996,  which  number  around  1.4 
million.  This  is  about  one  fourth  of  all  U.S.  utility  and  plant  pat¬ 
ents,  and  comprises  55  megabytes  of  text.  We  excluded  design 
patents,  because  the  content  of  a  design  claim  is  usually  an  image 
rather  than  a  text  description.  Patents  range  in  size  from  a  few 
kilobytes  to  around  1.5  megabytes.  We  include  the  full  text  of  all 
of  these  patents  in  our  collections. 

The  set  has  been  divided  into  subcollections  in  two  different  ways 
for  this  research.  The  chrono  set  is  divided  chronologically  into 
401  collections  of  roughly  equal  size  in  terms  of  numbers  of  pat¬ 
ents.  The  by-class  set  is  divided  by  patent  class  into  401  subcol¬ 
lections. 

There  is  no  standard  set  of  patent  queries  with  relevance  judg¬ 
ments  for  the  patent  collection.  We  constructed  37  queries  cov¬ 
ering  a  range  of  patent  areas,  non-technical  enough  for  laymen  to 
consistently  judge  the  relevance  of  patents  to  queries.  We  had 
searched  the  patent  collection  at  various  times  in  the  past  to  look 
for  prior  art,  and  some  of  the  queries  came  from  these  searches. 


Data  Set 

Size 

Avg.  Doc  Len 

Collections 

Docs  per  Collection  | 

GB 

Num  Docs 

Words 

Number 

Avg 

Min 

Max 

PTO  by  class 

55 

1,397,860 

5586 

401 

3486 

1 

34,271 

PTO  chrono 

55 

1,397,860 

5586 

401 

3486 

3,461 

3,486 

TREC3  by  topic 

2.2 

741,856 

260 

100 

7418 

100 

106,782 

TREC3  by  source 

2.2 

741,856 

260 

100 

7418 

7,294 

7,637 

Table  2:  Test  Collection  Summary  Statistics 


Data  Set 

Num  Queries 

Words  per  Query 

Rel  Docs  per  Query  1 

Avg 

Min 

Max 

Avg 

Min 

Max 

PTO 

37 

3.0 

1 

7 

35 

9 

68 

TREC3 

50 

34.5 

15 

58 

196 

14 

1141 

Table  3:  Query  Summary  Statistics 
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Two  of  the  three  experimenters  judged  the  relevance  of  docu¬ 
ments  to  these  queries.  We  collected  the  top  30  documents  re¬ 
turned  for  each  query  pooled  over  all  the  experimental  conditions. 
This  total  pool  of  documents  for  a  given  query  was  judged  by  a 
single  experimenter  for  consistency,  in  a  random  order  so  the 
judge  would  be  unaware  of  which  condition(s)  retrieved  the 
document.  Because  there  was  a  great  deal  of  overlap  in  the  sets  of 
documents  retrieved  for  a  query  across  the  different  conditions,  an 
average  of  90  documents  were  judged  per  query.  Table  3  shows 
more  information  about  the  queries. 

3.3  Distributed  Retrieval 

Retrieval  consisted  of  the  following  steps  in  all  experimental  con¬ 
ditions: 

(1)  Rank  the  collections  against  the  query.  The  collection  rank¬ 
ing  methods  are  either  CORI  or  KL,  described  below. 

(2)  Retrieve  the  best  30  (for  PTO)  or  100  (for  TREC)  documents 
from  each  of  the  ten  top  ranked  collections,  using  the  same  algo¬ 
rithm  as  in  INQUERY's  single  collection  retrieval  system  [3], 
modified  to  make  available  the  maximum  and  minimum  possible 
document  scores  for  normalization. 

(3)  Normalize  scores,  if  appropriate  to  the  experimental  condition, 
and  merge  the  results  lists.  The  four  merging  methods  are  de¬ 
scribed  in  detail  below.  The  baseline  method  is  global  idf,  and 
other  three  conditions  are  normalization  techniques  we  call  norm- 
both,  norm-dbs,  and  norm-docs.  For  TREC,  we  also  provide  a 
centralized  retrieval  baseline,  in  which  documents  are  retrieved 
from  a  single  large  database. 

To  address  the  topical  organization  issue,  we  query  the  patent 
collections  organized  by  class  and  by  date,  and  the  TREC  collec¬ 
tions  organized  by  topic  and  by  source. 

To  evaluate  retrieval  we  look  at  precision  at  5,  10,  15,  20,  and  30 
(and  100  for  TREC)  documents.  We  use  this  measure  rather  than 
the  more  usual  1 1  point  precision,  because  of  the  relatively  small 
number  of  relevant  documents  we  have  for  the  PTO  queries. 

3.3.1  CORI  Collection  Ranking 

In  the  CORI  net  approach,  collection  ranking  is  considered  to  be 
analogous  to  document  ranking.  Collections  are  treated  as 
pseudo-documents,  and  ranked  according  to  the  following  ana¬ 
logue  to  tf-idf  scores  for  document  retrieval  from  single  collec¬ 
tions  [1],  This  formulation,  for  a  simple  “natural  language”  query 
with  no  special  operators,  is  as  follows: 

Scorec  =  A ■>  (.4  +  .6-T,  •/,.) 

\Q\  M  J  1 

\Q\  is  the  number  of  terms  in  the  query,  Tj  is  the  if  analogue  for 
termy,  that  is: 


1  dfj  +  50  + 150  •  (cw/avg  _cw) 


and  f  is  the  idf  analogue  for  term  j,  that  is: 

J  _  log((iy  +  0.5)/c/j) 

J  log(lV  +  1.0) 

where  df  is  the  number  of  documents  in  collection  C  containing 
the  j'h  query  term,  cw  is  the  number  of  indexing  terms  in  C, 
avg_cw  is  the  average  number  of  indexing  terms  in  each  collec¬ 
tion,  N  is  the  number  of  collections,  and  cf  is  the  number  of  col¬ 
lections  containing  term  j. 

3.3.2  KL  Collection  Ranking 

In  Xu  and  Croft's  language  modeling  approach,  collections  are 
ranked  by  a  modification  of  the  Kullback-Leibler  (KL)  divergence 
which  measures  the  distance  between  a  query  Q  and  a  collection 
C: 


Scorec 


h  \Q\  08  (/cc,  W, )  +  /(<2.  W7  ))/(jG|  +  |C|) 


where  /( Q,  Wj)  is  the  number  of  occurrences  of  term  Wj  in  the 
query.  \Q\  is  the  number  of  term  occurrences  in  the  query.  f(C.wf) 
is  the  number  of  occurrences  of  the  term  Wj  in  the  collection,  and 
ICI  is  the  total  number  term  occurrences  in  the  collection. 


3. 3. 3  Normalization  for  Merging 

In  INQUERY,  document  scores  are  normalized  based  on  the 
maximum  ( Dmax )  and  minimum  (Dmin)  scores  any  document  could 
attain  for  the  query:  D  =(D-D  Via  -£>.)• 

Collection  scores  are  similarly  normalized  using  the  maximum 
( Cmax)  and  minimum  ( Cmln)  scores  a  collection  could  attain  for  the 
query:  C  =(C-C  )/(C  -  C  )  ■ 

The  final  ranking  score  for  a  document  combines  the  normalized 
collection  and  document  scores  into  a  final  score  for  the  docu¬ 
ment  which  we  call  norm-both ,  because  both  document  and  col¬ 
lection  scores  are  normalized: 

norm-both:  Score  =  (D  +0.4-C  D  )/1.4 

V  norm  norm  norm  // 

Two  other  normalization  methods  are  variations  of  the  norm-both 
approach.  Norm-docs  simply  uses  the  normalized  document 
score,  without  considering  any  contribution  of  collection  scores. 
Norm-dbs  combines  the  raw  document  score  with  a  normalized 
collection  score. 

norm-docs:  Score  =  Dmrm 

norm-dbs:  Score  =  (D  +  0.4  ■  C„orm  ■  D)/ 1 .4 

Norm-dbs  was  of  interest  because  it  was  the  method  in  use  when 
we  first  noticed  the  rare  term  problem  described  above.  It  is  the 
only  one  of  these  three  normalization  methods  that  requires  only  a 
list  of  documents  and  scores.  The  other  methods  require  ideal 
maximum  and  minimum  scores  for  each  query,  which  would  not 
be  available  from  an  uncooperative  provider.  Norm-docs  was 
included  under  the  reasoning  that  perfect  normalization  should 
yield  scores  similar  to  global  idf. 
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Figure  1.  TREC  precision,  CORI  vs  KL  collection  ranking,  Figure  2.  PTO  precision,  CORI  vs  KL  collection  ranking, 

organization  by  topic  and  by  source.  organization  by  class  and  chronological. 


Number  of  Collections 


Figure  3.  TREC  Distribution  of  relevant  documents  in  top  50 
collections,  organization  by  topic  and  by  source, 

CORI  vs  KL  ranking 


Figure  4.  PTO  Distribution  of  relevant  documents  in  top  50 
collections,  organization  by  class  and  chronological, 
CORI  vs  KL  ranking 


4.  RESULTS 

4.1  Topical  Organization 

Figure  1  shows  that  for  TREC  data,  organization  by  topic  gives 
better  retrieval  results  than  organization  by  source,  replicating  Xu 
and  Croft’s  findings  for  KL  collection  ranking  (open  symbols  and 
dotted  lines)  and  extending  these  findings  to  CORI  collection 
ranking  (filled  symbols  and  solid  lines).  As  anticipated,  the  larger 
PTO  data  set  also  shows  this  pattern  (Figure  2).  This  topical  su¬ 
periority  holds  for  all  other  methods  of  result  list  merging,  as  we 
will  illustrate  below. 


The  optimal  curves  represent  the  case  where  the  collections  are 
ordered  by  the  actual  number  of  relevant  documents  in  each,  aver¬ 
aged  over  all  queries.  This  provides  an  upper  bound  for  collec¬ 
tion  ranking  algorithms.  When  collections  are  organized  by  topic 
(circles  in  the  plots),  relevant  documents  tend  to  be  concentrated 
into  a  small  number  of  collections.  When  collections  are  not  or¬ 
ganized  by  topic  (squares),  relevant  documents  are  more  scattered 
throughout  collections,  limiting  the  number  of  documents  that  can 
be  retrieved  from  10  collections. 

Interestingly,  the  advantage  for  topical  organization  is  much  more 
pronounced  for  the  PTO  data  than  for  the  TREC  data.  This  ap¬ 
pears  to  be  both  because  topical  organization  is  better  for  PTO 
than  for  TREC  and  because  the  non-topical  organization  is  worse 
for  PTO  than  for  TREC.  Relevant  documents  are  more  concen- 


One  reason  for  the  topical  superiority  can  be  seen  in  Figure  3  and 
Figure  4,  which  show  the  distribution  of  relevant  documents  in 
the  top  50  collections  as  ranked  by  the  CORI  and  KL  algorithms. 
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trated  into  a  smaller  proportion  of  PTO  by-class  collections  than 
the  TREC  by-topic  collections:  The  top  10  PTO  collections  are 
only  2.5%  of  the  400  total  collections,  but  cover  83.7%  of  the 
known  relevant  documents.  The  top  10  TREC  collections  cover 
10%  of  the  data  but  include  only  78.5%  of  the  known  relevant 
documents.  On  the  other  hand,  chronological  organization  for 
PTO  is  worse  than  organization  by  source  for  TREC,  in  that  rele¬ 
vant  PTO  documents  are  more  evenly  spread  across  collections. 

4.2  Collection  Ranking  Methods 

The  same  figures  illustrate  the  comparison  of  collection  ranking 
algorithms,  the  Kullback-Leibler  divergence  and  INQUERY’s 
CORI  algorithm,  addressing  the  generality  of  the  claim  that  KL  is 
a  better  way  to  select  topically  organized  collections.  Collections 
were  ranked  either  via  CORI  or  KL.  We  consider  only  global  idf 
here  to  separate  the  collection  ranking  issue  from  that  of  merging. 

We  replicated  Xu  and  Croft’s  findings  that  KL  yields  better  re¬ 
trieval  performance  than  CORI  on  topically  organized  TREC  data 
(Figure  1).  KL  retrieval  is  almost  as  good  as  retrieval  from  a  sin¬ 
gle  centralized  collection.  However,  KL  is  better  than  CORI  only 
on  topically  organized  data.  KL  performs  worse  than  CORI  on 
TREC  data  organized  by  source. 

On  the  PTO  data  organized  by  class  in  Figure  2,  the  KL  metric 
shows  only  a  very  small  advantage  over  CORI.  if  any.  Compared 
to  the  large  KL  advantage  on  TREC,  the  KL  advantage  on  topical 
PTO  data  is  very  small.  KL  performs  substantially  worse  on  the 
non-topical  PTO  data  than  does  CORI. 

The  corresponding  distributions  of  relevant  documents  across 
collections  as  ranked  by  KL  and  CORI  (Figure  4  and  Figure  3  ) 
show  that  there  is  not  much  difference  between  KL  and  CORI  in 
the  number  of  relevant  documents  seen  in  the  first  10  collections. 
This  lack  of  a  difference  holds  for  PTO  and  TREC  data,  and  in  the 
topical  and  non-topical  conditions.  However,  if  we  retrieved 
documents  from  more  than  10  collections,  we  would  have  seen 
differences  between  CORI  and  KL  in  the  numbers  of  relevant 
documents  available. 

The  distributions  of  relevant  documents  across  collections  in 
Figure  3  and  Figure  4  are  difficult  to  interpret.  For  both  organi¬ 
zations,  by  topic  and  by  source,  the  distributions  show  essentially 
the  same  proportion  of  relevant  documents  in  the  top  ranking  10 
collections,  whether  they  are  ranked  by  CORI  or  by  KL.  We  can¬ 
not  attribute  the  better  performance  of  KL  on  topically  organized 
data  to  its  choosing  collections  with  more  relevant  documents. 
Instead,  KL  somehow  selects  collections  where  the  relevant  docu¬ 
ments  receive  higher  INQUERY  scores.  Similarly,  on  the  TREC 
data  organized  by  source,  KL  selects  collections  with  about  the 
same  number  of  relevant  documents  as  CORI,  but  these  docu¬ 
ments  receive  lower  scores,  and  hence  lower  ranks. 

4.3  Document  List  Merging 

The  picture  is  also  complicated  when  we  consider  document  list 
merging.  For  the  topical  PTO  data  in  Figure  5  we  see  large  differ¬ 
ences  in  precision  between  merging  algorithms.  Global  idf  is 
better  than  norm-both,  which  is  better  than  norm-docs,  which  is 
better  than  norm-dbs.  When  the  PTO  data  are  organized  chrono¬ 


logically,  all  the  merging  techniques  yield  the  same  precision 
(Figure  6).  This  lack  of  difference  is  due  to  the  fact  that  all  the 
chronological  subcollections  have  very  similar  term  statistics. 
Therefore,  document  scores  from  single  collection  retrieval  are 
already  normalized  relative  to  each  other,  and  further  normaliza¬ 
tion  makes  no  difference. 

The  TREC  results  show  much  smaller  differences  among  merging 
algorithms  than  the  PTO  results  show.  When  the  organization  is 
by  topic,  (Figure  7),  global  idf  is  better  than  all  three  normali¬ 
zation  methods,  which  are  indistinguishable  from  one  another. 
When  the  organization  is  by  source  (Figure  8),  global  idf  is  only 
slightly  better  than  the  other  merging  methods.  In  contrast  to  the 
findings  of  Powell,  et  al.  [12],  we  find  that  global  idf  gives  better 
results  than  any  normalization. 

Taken  together,  the  PTO  and  TREC  results  show  that  for  topically 
organized  data,  global  idf  is  preferable  to  any  of  the  normalization 
methods  above.  This  result  is  contrary  to  the  claims  of  Powell,  et 
al.  that  by  normalizing  both  document  and  collection  scores  one 
can  attain  merging  performance  that  is  as  good  as  using  global  idf. 
The  key  factor  is  probably  the  degree  of  skew  in  the  term  fre¬ 
quency  distributions  of  the  different  collections.  The  PTO  divi¬ 
sion  by  class  is  extreme  in  that  term  frequencies  for  a  query  word 
can  vary  greatly  in  different  subcollections,  so  that  documents 
from  different  subcollections  can  have  extremely  disparate  scores. 
Normalization  is  not  sufficient  to  overcome  the  skewed  scores  for 
PTO.  However,  it  can  compensate  for  the  differences  among  less 
skewed  subdivisions 

5.  DISCUSSION 

5.1  Topical  Organization 

We  have  shown  superior  retrieval  from  collections  that  are  subdi¬ 
vided  along  topical  lines.  Division  of  patents  by  chronology,  in 
contrast,  produces  subcollections  that  cannot  be  distinguished 
from  one  another  statistically,  and  can  therefore  not  be  effectively 
ranked  by  any  selection  algorithm.  A  TREC3  subdivision  by 
source  falls  between  a  topical  organization  and  a  chronological 
organization.  With  division  by  source,  similar  documents  are 
somewhat  concentrated  into  subcollections,  and  hence  there  is 
potential  for  retrieval  from  a  small  number  of  collections  to  be 
effective. 

In  our  experiments,  topical  organization  seemed  to  have  a  larger 
effect  with  PTO  data  than  with  the  TREC  data,  perhaps  because 
of  the  comparison  to  the  chronological  baseline,  which  is  less 
organized  than  TREC's  by-source  baseline.  There  is  more  going 
on,  however.  The  distributions  seem  to  show  more  concentration 
of  relevant  documents  into  fewer  subcollections  for  PTO  by  class 
than  for  TREC  by  topic.  It  is  possible,  however,  that  this  is  an 
artifact  of  our  judging  only  documents  that  were  retrieved  in  our 
experiments,  or  of  the  queries  being  particularly  aimed  at  one  or  a 
small  number  of  patent  classes.  Or  it  may  be  that  the  existing 
manual  patent  classification  system  is  a  better  organization  for 
patent  searching  than  global  clustering  is  for  TREC  queries. 
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Figure  5.  PTO  by  class  precision  for  four  results  merging  al¬ 
gorithms 


Figure  6.  PTO  chrono  precision  for  four  results  merging 
algorithms 
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Figure  7.  TREC  by  topic  precision  for  results  merging 
algorithms 


Figure  8.  TREC  by  source  precision  for  results  merging 
algorithms 


5.2  Collection  Ranking 

The  comparison  of  CORI  with  KL  collection  ranking  methods 
confirmed  that  KL  is  clearly  better  than  CORI  when  the  subcol¬ 
lections  have  been  clustered  using  KL.  On  PTO  data,  where  the 
topics  are  based  on  human-designed  classes,  KL  shows  only  a 
very  small  gain,  if  any,  over  CORI  in  the  distribution  of  relevant 
documents  and  no  gain  in  precision.  However.  KL  gives  worse 
results  than  CORI  when  collections  are  not  organized  by  topic,  as 
we  see  with  the  TREC  by-source  results  and  with  the  PTO 
chronological  results.  KL  is  effective  for  topical  organizations, 
but  should  not  be  used  when  collections  are  not  organized  topi¬ 
cally. 

5.3  Results  Merging 

We  have  shown  that  for  results  merging,  none  of  the  three  nor¬ 
malization  methods  works  as  well  as  global  idf,  for  both  PTO  and 
TREC  data  sets.  We  found  big  differences  among  the  normaliza¬ 
tion  methods  on  the  PTO  data.  It  is  more  effective  to  normalize 


both  collection  and  document  scores  and  combine  them,  than  it  is 
to  normalize  either  scores  alone.  However,  in  contrast  to  Powell’s 
results,  none  of  these  versions  of  normalization  perform  as  well  as 
using  global  idf,  probably  because  the  term  distributions  are  so 
skewed. 

5.4  Implications 

The  results  of  this  study  suggest  that  the  best  way  to  implement 
the  distributed  patent  search  system  is  to  divide  up  the  collection 
by  patent  class,  to  use  CORI  or  KL  for  collection  ranking,  and  to 
use  global  idf  for  merging. 

This  pattern  of  results  has  some  bearing  upon  how  one  might  want 
to  merge  results  lists  in  the  case  of  retrieval  from  disparate  pro¬ 
viders  when  one  cannot  control  (or  even  know)  how  document 
scores  are  computed,  or  in  the  worse  case,  when  the  provider  re¬ 
turns  no  document  scores  at  all.  One  could  compute  INQUERY 
style  document  scores  for  the  top  n  documents  on  each  results  list 
using  just  the  text  of  the  documents  and  the  collection  wide  fre- 
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quency  information  available  in  the  collection-selection  database, 
which  was  either  obtained  by  cooperation  from  providers  or  esti¬ 
mated  by  sampling.  The  //  part  of  the  tf-idf  score  could  be  derived 
by  parsing  the  documents  and  counting  occurrences  of  query 
words  in  the  documents.  The  iclf  component  is  a  simple  function 
of  the  frequency  information  in  the  collection.  It  would  require 
very  high  bandwidth  to  get  the  text  of  all  the  documents  to  be 
ranked,  but  as  connections  get  faster  this  will  be  possible. 
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